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Abstract 

In many applications involving multi-media data, the definition of similarity between items is 
integral to several key tasks, e.g., nearest-neighbor retrieval, classification, and recommendation. 
^ ' Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of 

O . video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key 

challenge to be overcome in many real-world applications. 

We present a novel multiple kernel learning technique for integrating heterogeneous data into 
£> ' a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transfor- 

mations which conform to measurements of human perceptual similarity, as expressed by relative 
comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multi- 
ly— ^ , media similarity, we develop graph-based techniques to filter similarity measurements, resulting in 

' a simplified and robust training procedure. 

o 

1. Introduction 

In applications such as content-based recommendation systems, the definition of a proper similarity 
measure between items is crucial to many tasks, including nearest-neighbor retrieval and classifica- 
tion. In some cases, a natural notion of similarity may emerge from domain knowledge, e.g., cosine 
similarity for bag-of-words models of text. However, in more complex, multi-media domains, there 
is often no obvious choice of similarity measure. Rather, viewing different aspects of the data may 
lead to several different, and apparently equally valid notions of similarity. For example, if the 
corpus consists of musical data, each song or artist may be represented simultaneously by acoustic 
features (such as rhythm and timbre), semantic features (tags, lyrics), or social features (collabora- 
tive filtering, artist reviews and biographies, etc). Although domain knowledge may be employed 
to imbue each representation with an intrinsic geometry — and, therefore, a sense of similarity — 
the different notions of similarity may not be mutually consistent. In such cases, there is gener- 
ally no obvious way to combine representations to form a unified similarity space which optimally 
integrates heterogeneous data. 

Without extra information to guide the construction of a similarity measure, the situation seems 
hopeless. However, if some side-information is available, e.g., as provided by human labelers, it can 
be used to formulate a learning algorithm to optimize the similarity measure. 
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This idea of using side-information to optimize a similarity function has received a great deal 
of attention in recent years. Typically, the notion of similarity is captured by a distance metric over 
a vector space (e.g., Euclidean distance in M. d ), and the problem of optimizing similarity reduces 
to finding a suitable embedding of the data under a specific choice of the distance metric. Metric 
learning methods, as they are known in the machine l earning lite r ature, can be informed by var- 
ious types of side-info r matio n , including class l abels (King et all. 120031 : iGoldberger et all. 12005 



Globerson and Roweis, 2006; Weinberger et al. 



bels dWagstaff etalJ. 12001 



Shental et al 



20021 : 



200 6]), or b i nary s imilar/dissimilar pairwise la 



Bilenko et all. 120041 : iGloberson and Roweisl 12007 



Davis et all 120071 ). Alternatively, multidimensional scaling (M PS) techniques are typically for 



mulated in terms o f quantitative (dis)s i rnilari ty measurements (ITorgersorJ . 119521 : iKruskall . 11964 



lqi 
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Cox and Coxl. 1 19941 : iBorg and Groenenl. |2005J). In these settings, the representation of data is op- 
timized so that distance (typically Euclidean) conforms to side-information. Once a suitable met- 
ric has been learned, similarity to new, unseen data can be computed either directly (if the met- 
ric ta kes a certain parame tric form, e.g., a linear projection matrix), or via out-of-sample exten- 
sions (IBengio et all 120041) . 

To guide the construction of a similarity space for multi-modal data, we adopt the idea of using 
similarity measurements, provided by human labelers, as side-information. However, it has to be 
noted that, especially in heterogeneous, multi-media dom ains, simil a rity m ay itself be a highly 
subjective concept and vary from one labeler to the next (lEllis et all. 120021) . Moreover, a single 
labeler may not be able to consistently decide if or to what extent two obje cts are similar, but she 
may still be able to reliably produce a rank-ordering of similarity over pairs ( Kendall and Gibbons!. 
1990). Thus, rather than rely on quantitative similarity or hard binary labels of pairwise similarity, 
it is now becoming in creasingly common to collect similarity informatio n in the form of triadic or 
relative comparisons (ISchultz and Joachims! . 120041 : lAgarwal et aUl2007r) . in which human labelers 
answer questions of the form: 

"Is x more similar to y or zT 

Although t his form of similarity measur ement has been observed to be more stable than quantitative 
similarity ( Kendall and Gibbons . 199fj|) . and clearly provides a richer representation than binary 
pairwise similarities, it is still subject to problems of consistency and inter-labeler agreement. It is 
therefore imperative that great care be taken to ensure some sense of robustness when working with 
perceptual similarity measurements. 

In the present work, our goal is to develop a framework for integrating multi-modal data so as 
to optimally conform to perceptual similarity encoded by relative comparisons. In particular, we 
follow three guiding principles in the development of our framework: 

1. The embedding algorithm should be robust against subjectivity and inter-labeler disagree- 
ment. 



2. The algorithm must be able to integrate multi-modal data in an optimal way, i.e., the distances 
between embedded points should conform to perceptual similarity measurements. 

3. It must be possible to compute distances to new, unseen data as it becomes available. 

We formulate this problem of heterogeneous feature integration as a learning problem: given 
a data set, and a collection of relative comparisons between pairs, learn a representation of the 
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Figure 1: An overview of our proposed framework for multi-modal feature integration. Data is 
represented in multiple feature spaces (each encoded by a kernel function). Humans 
supply perceptual similarity measurements in the form of relative pairwise comparisons, 
which are in turn filtered by graph processing algorithms, and then used as constraints to 
optimize the multiple kernel embedding. 



data that optimally reproduces the similarity measu rem ents. This type of embeddi ng problem 
has been previously studied by Agarwal et al. (2007) and Schultz and Joachims! ( 2004 ). However, 



Agarwal et al.l (|2007l) provide no out-of-sample extension, and neither support heterogeneous fea- 



ture integration, nor do they address the problem of noisy similarity measurements. 

A common approach to optimally integrate heterogeneous data is based on multiple kernel 
learning, where each kernel encodes a different modality of the data. Heterogeneous feature integra- 
tion via multiple kern el learning has been addressed by previous authors in a variety of contexts, in- 



2009Q . re gression d Sonnenb urg et al.L 120061 : 



eludin g classification (Lanckriet et al., 12004 ;|Zi en and Ongll2007HKloft et aUl2009uJagarlapudi et al. 



BachLl2008l : ICortes et all 120090 . and dimensionality re- 



duction dLin et all 120091) . However, none of these methods specifically address the problem of 



learning a unified data representation which conforms to perceptual similarity measurements. 



1.1 Contributions 



Our contrib utions in this work are two-fol d. First, we develop the partial order embedding (POE) 
framework dMcFee and Lanckrietl. l2009bl) . which allows us to use graph-theoretic algorithms to 
filter a collection of subjective similarity measurements for consistency and redundancy. We then 
formulate a novel multiple kernel learning (MKL) algorithm which learns an ensemble of feature 
space projections to produce a unified similarity space. Our method is able to produce non-linear 
embedding functions which generalize to unseen, out-of-sample data. Figure [Qprovides a high-level 
overview of the proposed methods. 
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The remainder of this paper is structured as follows. In Section [2j we develop a graphical 
framework for interpreting and manipulating subjective similarity measurements. In Section |3j we 
derive an embedding algorithm which learns an optimal transformation of a single feature space. 
In Section |H we develop a novel multiple-kernel learning formulation for embedding problems, 
and derive an algorithm to learn an optimal space from heterogeneous data. Section [5] provides 
experimental results illustrating the effects of graph-processing on noisy similarity data, and the 
effectiveness of the multiple-kernel embedding algorithm on a music similarity task with human 
perception measurements. Finally, we prove hardness of dimensionality reduction in this setting in 
Section[6l and conclude in Section|7J 

1.2 Preliminaries 

A (strict) partial order is a binary relation R over a set Z (RCZ 2 ) which satisfies the following 
properties [U 

• Irreflexivity: (o, a) £ R, 

• Transitivity: (a, b) G R A (6, c) G R (a, c) G R, 

• Anti-symmetry: (a, b) G R => (b, a) ^ R. 

Every partial order can be equivalently represented as a directed acyclic graph (DAG), where 
each vertex is an element of Z and an edge is drawn from a to 6 if (a, b) G R. For any partial 
order, R may refer to either the set of ordered tuples {(a, b)} or the graph (DAG) representation of 
the partial order; the use will be clear from context. Let diam(i?) denote the length of the longest 
(finite) source-to-sink path in the graph of R. 

For a directed graph G, we denote by G°° its transitive closure, i.e., G°° contains an edge 
if and only if there exists a path from i to j in G. Similarly, the transitive reduction (denoted G min ) 
is the minimal graph with equivalent transitivity to G, i.e., the graph with the fewest edges such that 

min^ 00 _ (~}oo 

Let X = {x\,X2, . . . ,x n } denote the training set of n items. A Euclidean embedding is a 
function g : X — > M. d which maps X into a <i-dimensional space equipped with the Euclidean (£2) 
metric: 

Use - y\\ 2 = \J{x- y) T (x - y). 

For any matrix B, let B{ denote its i column vector. A symmetric matrix A G R nxn has a 
spectral decomposition A = VAV T , where A = diag(Ai, A2, . . . , A n ) is a diagonal matrix con- 
taining the eigenvalues of A, and V contains the eigenvectors of A. We adopt the convention that 
eigenvalues (and corresponding eigenvectors) are sorted in descending order. A is positive semi- 
definite (PSD), denoted by A y 0, if each eigenvalue is non-negative: Aj > 0, i = 1, . . . ,n. 
Finally, a PSD matrix A gives rise to the Mahalanobis distance function 



x ~ V\U = V {x - y) T A(x - y). 



1. The standard definition of a (non-strict) partial order also includes the reflexive property: Va, (a, a) G R. For reasons 
that will become clear in Sectionf2] we take the strict definition here, and omit the reflexive property. 
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2. A graphical view of similarity 



Before we can construct an embedding algorithm for multi-modal data, we must first establish the 
form of side-information that will drive the algorithm, i.e., the similarity measurements that will be 
collected from human labelers. There is an extensive body of work on the topic of constructing a 
geometric representation of data to fit perceptual similarity measurements. Primarily, this work falls 
under the umbrella of multi-dimensional scaling (MDS), in which perceptual similarity is modeled 
by numerical responses corresp onding to the perceived "distance" between a p air of items, e.g., on 
a similarity scale of 1-10. (See lCox and Coxl d 1 9941 ) : lB org and Groenenl (12005b for comprehensive 
overviews of MDS techniques.) 

Because "distances" supplied by test subjects may not satisfy metric properties — in particular, 
they may not corresp ond to Euclidean distances — alternative non-metric MDS (NMDS) techniques 
have been proposed (IKruskall. Il964l) . Unlike classical or metric MDS techniques, which seek to 
preserve quantitative distances, NDMS seeks an embedding in which the rank-ordering of distances 
is preserved. 

Since NMDS only needs the rank-ordering of distances, and not the distances themselves, the 
task of collecting similarity measurements can be simplifed by asking test subjects to order pairs of 
points by similarity: 

"Are i and j more similar than k and £T 

or, as a special case, the "triadic comparison" 

"Is i more similar to j or £T 

Based on this kind of relative comparison data, the embedding problem can be formulated as fol- 
lows. Given is a set of objects X, and a set of similarity measurements C = k,£)} C X 4 , 
where a tuple k,£) is interpreted as "i and j are more similar than k and £." (This formulation 
subsumes the triadic comparisons model when i = k.) The goal is to find an embedding function 
g : X R d such that 



V(i,j,kJ)eC: \\g{i)-g(j)f + l < \\g{k)-g 



Th e unit margin is forced between the constrained distances for numerical stability. 



(1) 



Agarwal et al.l (120071) work with this kind of relative comparison data and describe a generalized 



NMDS algorithm (GNMDS) , which formulates the embedding problem as a semi-definite program. 



Schultz and Joachims! (120041) derive a similar algorithm which solves a quadratic program to learn 



a linear, axis-aligned transformation of data to fit relative comparisons. 

Previous work on r elative comparison data often treats each measure ment (i,j,k,£) G C as 
effectively independent (|Schultz and Joachims! . 120041 : lAgarwal et all l2007h . However, due to their 
semantic interpretation as encoding pairwise similarity comparisons, and the fact that a pair (i, j) 
may participate in several comparisons with other pairs, there may be some global structure to C 
which these previous methods are unable to exploit. 

In Section 12- 1 1 we develop a graphical framework to infer and interpret the global structure 
exhibited by the constraints of the embedding problem. Graph-theoretic algorithms presented in 
Section [2721 then exploit this representation to filter this collection of noisy similarity measurements 
for consistency and redundancy. The final, reduced set of relative comparison constraints defines a 
partial order, making for a more robust and efficient embedding problem. 
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U,k,j,£), (j,k,i,k), 
(j,£,i,k), (i,£,j,l), 
(i,£,i,j), (i,j,j,£) 



Figure 2: The graph representation (left) of a set of relative comparisons (right). 



2.1 Similarity graphs 

To gain more insight into the underlying structure of a collection of comparisons C, we can represent 
C as a directed graph over X 2 . Each vertex in the graph corresponds to a pair G X 2 , and 

an edge from to (k,£) corresponds to a similarity measurement (i,j,k,£) (see Figure O. 
Interpreting C as a graph will allow us to infer properties of global (graphical) structure of C. In 
particular, two facts become immediately apparent: 

1. If C contains cycles, then there exists no embedding which can satisfy C. 

2. If C is acyclic, any embedding that satisfies the transitive reduction C mm also satisfies C. 

The first fact implies that no algorithm can produce an embedding which satisfies all measure- 
ments if the graph is cyclic. In fact, the converse of this statement is also true: if C is acyclic, then 
an embedding exists in which all similarity measurements are preserved (see Appendix |A]). If C is 
cyclic, however, by analyzing the graph, it is possible to identify an "unlearnable" subset of C which 
must be violated by any embedding. 

Similarly, the second fact exploits the transitive nature of distance comparisons. In the example 
depicted in Figure |2 any g that satisfies (j,k,j,£) and (j,£,i,k) must also satisfy (j,k,i,k). In 
effect, the constraint (j, k, i, k) is redundant, and may also be safely omitted from C. 

These two observations allude to two desirable properties in C for embedding methods: tran- 
sitivity and anti-symmetry. Together with irreflexivity, these fit the defining characteristics of a 
partial order. Due to subjectivity and inter-labeler disagreement, however, most collections of rel- 
ative comparisons will not define a partial order. Some graph processing, presented next, based on 
an approximate maximum acyclic subgraph algorithm, can reduce them to a partial order. 



2.2 Graph simplification 

Because a set of similarity measurements C containing cycles cannot be embedded in any Euclidean 
sp ace, C is inherentl y inconsistent. Cycles in C therefore constitute a form of label noise. As noted 
by Angeloval ( 2004 ). label noise can have adverse effects on both model complexity and general- 
ization. This problem can be mitigated by detect ing and pruning noisy (confusing) examples, and 



traini ng on a reduced, but certifiably "clean" set (|Angelova et all 120051 : IVezhnevets and Barinoval. 
120071) . 

Unlike most set tings, where the noise pro cess affects each label independently — e.g., random 
classification noise dAngluin and Lairdl.ll988h — the graphical structure of interrelated relative com- 
parisons can be exploited to detect and prune inconsistent measurements. By eliminating similarity 
measurements which cannot be realized by any embedding, the optimization procedure can be car- 
ried out more efficiently and reliably on a reduced constraint set. 
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Ideally, when eliminating edges from the graph, we would like to retain as much information 
as possible. U nfortunately, this is equiva lent to the maximum acyclic subgraph problem, which is 
NP-Complete (|Garey and Johnsonl.ll979h. A ^/^approximate solution can be achieved by a simple 



JNr-Complete (|(jarey and Jonnsonl. 11^/^1). A 1 /2-approx ii 
greedy algorithm (Algorithm [Til dBerger and Shor . Il990b . 



Algorithm 1 Approximate maximum acyclic subgraph 
Input: Directed graph G = (V, E) 
Output: Acyclic graph G' 

E' <- 

for each (u, v) € E in random order do 

if E' U {(u, v)} is acyclic then 
E' <- E'U{(u,v)} 

end if 
end for 
G' <- (V, E') 



Once a consistent subset of similarity measurements has been produced, it can be simplified 
further by pruning redundancies. In the graph view of similarity me asurements , redun dancies can 
be easily removed by computing the transitive reduction of the graph (lAho et all. 1 19721) . 

By filtering the constraint set for consistency, we ensure that embedding algorithms are not 
learning from spurious information. Additionally, pruning the constraint set by transitive reduc- 
tion focuses embedding algorithms on the most important core set of constraints while reducing 
overhead due to redundant information. 



3. Partial order embedding 

Now that we have developed a language for expressing similarity between items, we are ready to 
formulate the embedding problem. In this section, we develop an algorithm that learns a represen- 
tation of data consistent with a collection of relative similarity measurements, and allows to map 
unseen data into the learned similarity space after learning. In order to accomplish this, we will 
assume a feature representation for X. By parameterizing the embedding function g in terms of 
the feature representation, we will be able to apply g to any point in the feature space, thereby 
generalizing to data outside of the training set. 

3.1 Linear projection 

To start, we assume that the data originally lies in some Euclidean space, i.e., X C R D . There are of 
course many ways to define an embedding function g : M. D — > M. d . Here, we will restrict attention 
to embeddings parameterized by a linear projection matrix M, so that for a vector x G W D , 

g(x) = Mx. 

Collecting the vector representations of the training set as columns of a matrix X G R £>xn , the 
inner product matrix of the embedded points can be characterized as 

A = X T M T MX. (2) 
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Now, for a relative comparison k,£), we can express the distance constraint (Q]) between 
embedded points as follows: 

{X t - X j ) T M T M(X i -Xj) + l<(X k - X i ) r M T M(X k - X e ). (3) 

These inequalities can then be used to form the constraint set of an optimization problem to solve 
for M. Because, in general, C may not be satisfiable by a linear projection of X, we soften the 
constraints by introducing a slack variable £ijk£ > for each constraint, and minimize the empirical 
hinge loss over constraint violations 1 /\c\ Y^c £ijkt- This choice of loss function can be interpreted 
as a generalization of ROC area (see Appendix |Cl). 

To avoid over-fitting, we introduce a regularization term tr(M T M), and a trade-off parame- 
ter /3 > to control the balance between regularization and loss minimization. This leads to a 
regularized risk minimization objective: 

mm tr(MTM) + ^p^ (4) 

s.t. (Xi - X J ) T M T M(X i - Xj) + l<(X k - X e ) T M T M(X k - X e ) + £ ijkt , 

After learning M by solving this optimization problem, the embedding can be extended to out-of- 
sample points x' by applying the projection: x' \-t Mx' . 

Note that the distance constraints in © involve differences of quadratic terms, and are therefore 
not convex. However, since M only appears in the form M T M in ©, we can equivalently express 
the optimization problem in terms of a positive semi-definite matrix W = M T M. This change of 
variables results in Algorithm |2) a convex optimization proble m, more specifically a semi-definite 



programming (SDP) problem (|Boyd and Vandenberghd . 120041) . since objective and constraints are 



linear in W, including the linear matrix inequality W >z 0. The corresponding inner product matrix 
is 

A = X J WX. 

Finally, after the optimal W is found, the embedding function g can be recovered from the 
spectral decomposition of W: 

W = VAV T => g{x) = A 1/2 V T x. 



3.2 Non-linear projection via kernels 

The formulation in Algorithm |2] can be generalized to supp o rt non -linear embeddings by the use 
of kernels, following the method of iGloberson and Roweisl (120071) : we first map the data into a 
reproducing kernel Hilbert space (RKHS) % via a feature map 4> with corresponding kernel function 
k(x, y) = {(p(x),(f)(y))-}i; then, the data is mapped to M. d by a linear projection M :% — > M. d . The 
embedding function g : X — > K is the therefore the composition of the projection M with <j>: 



g(x) = M(cj)(x)). 



Because <fi may be non-linear, this allows to learn a non-linear embedding g. 
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Algorithm 2 Linear partial order embedding (LPOE) 
Input: n objects X, 
partial order C, 
data matrix X el Dxn , 
P > 

Output: mapping g : X — > M. d 

mm \x(W) + JLX2£ ijkt 
' s 1 1 C 

d( Xi , Xj ) = (Xi - Xj) T W (Xi - X 3 ) 
d(xi,Xj) + 1 < d(x k ,Xi) + Cijke 

ti jk t>0 V(i,j,k,l)eC 



More precisely, we consider M as being comprised of d elements of H, i.e., {uj\, u)2, • • • , ojj] C 
Ti. The embedding g can thus be expressed as 

g(x) = ({u p , <t>{x))n) d v= i , 

where (-)p =1 denotes concatenation over d vectors. 

Note that in general, % may be infinite-dimensional, so directly optimizing M may not be 
feasible. However, by appropri ately regularizing M, we may invoke the generalized representer 
theorem (IScholkopf et all 120011) . Our choice of regularization is the Hilbert-Schmidt norm of M, 
which, in this case, reduces to 

d 

\\ M \\m = y ^2(up,u}. p ) n - 
P =i 

With this choice of regularization, it follows from the generalized representer theorem that at an 
optimum, each co p must lie in the span of the training data, i.e., 

n 

u p = ^2 N pi4>(xi), p = l,-..,d, 

i=l 

for some real-valued matrix N G M dxn . If <J> is a matrix representation of X in % {i.e., $j = (j)(xi) 
for Xi £ X), then the projection operator M can be expressed as 

M = iV$ T . (5) 

We can now reformulate the embedding problem as an optimization over N rather than M. 
Using ©, the regularization term can be expressed as 

||M||^ S = tr($iV T iV$ T ) = tr(iV T iV$ T $) = tr(N T NK), 
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where K is the kernel matrix over X: 

K = $ T $, with Kij = {(j>{xi), <p{xj)) n = k(xi,Xj). 

To formulate the distance constraints in terms of N, we first express the embedding g in terms of N 
and the kernel function: 

g{x) = M(cf>(x)) = N$ T (<p(x)) = N 0(z))«)? =1 = N (k(x u x))^ =1 = NK X , 

where K x is the column vector formed by evaluating the kernel function k at x against the training 
set. The inner product matrix of embedded points can therefore be expressed as 

A = KN T NK, 

which allows to express the distance constraints in terms of N and the kernel matrix K: 

(K - Kj) T N T N(Ki — Kj) + 1 < (K k - K e ) T N T N(K k - K t ). 
The embedding problem thus amounts to solving the following optimization problem in iV and £: 

mm tv{N T NK) + A ]T £ ijkt (6) 

I I c 

s.t. (K t - Kj) T N T N(Ki - Kj) + 1 < (K k - K e ) T N T N(K k - K t ) + £ ijkt , 
V(i,j,k,t)€C. 

Again, the distance constraints in © are non-convex due to the differences of quadratic terms. 
And, as in the previous section, N only appears in the form of inner products N T N in © — both 
in the constraints, and in the regularization term — so we can again derive a convex optimization 
problem by changing variables to W = N T N >z 0. The resulting embedding problem is listed 
as Algorithm [3 again a semi-definite programming problem (SDP), with an objective function and 
constraints that are linear in W. 

After solving for W, the matrix N can be recovered by computing the spectral decomposition 
W = VAV T , and defining N = A 1 / 2 V T . The resulting embedding function takes the form: 

g(x) = k l ' 2 V T K x . 



As in Schultz and Joachims! (2004), this formulation can be interpreted as learning a Maha- 



lanobis distance metric ^VF^ 7 over %. More generally, we can view this as a form of kernel 
learning, where the kernel matrix A is restricted to the set 

A G {KWK : W y 0} . (7) 



3.3 Connection to GNMDS 

We conclude this section by drawin g a connection betwee n Algorithm [3] and the generalized non 



metric MDS (GNMDS) algorithm of lAgarwal et all (120071) . 
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Algorithm 3 Kernel partial order embedding (KPOE) 
Input: n objects X, 
partial order C, 
kernel matrix K, 

P > 

Output: mapping g : X — > W 1 

mm tr (WK) + JLj2&M 

d(xi, Xj ) = (K t - Kj) 7 W (Ki - Kj) 
d(xi,Xj) + 1 < d(x k ,x e ) + ii jM 

6;w>0 V(i,j,h,£)eC 
W>zO 



First, we observe that the i-th column, K^, of the kernel matrix K can be expressed in terms of 
K and the i standard basis vector e j : 

Ki = Ke % . 

From this, it follows that distance computations in Algorithm [3] can be equivalently expressed as 

d(xi,Xj) = (K - Kj) r W(Ki - Kj) 

= (K{ ei - e j )) T W(K(e l - e 3 )) 

= (e l -e J ) T K T WK(e i -e 3 ). (8) 

If we consider the extremal case where K = I, i.e., we have no prior feature-based knowledge of 
similarity between points, then Equation [8] simplifies to 

d(xi,Xj) = (e» - ej) T IWI(ei - ej) = Wu + Wjj - - Wji. 

Therefore, in this setting, rather than defining a feature transformation, W directly encodes the inner 
products between embedded training points. Similarly, the regularization term becomes 

tr(WK) = tr(WI) = tr(W). 
Minimizing the regularization term can be in terpreted as minimizing a convex upper bound on 



the rank of W (IBoyd and Vandenberghd . 12004) . which expresses a preference for low-dimensional 
embeddings. Thus, by setting K = I in Algorithm |3l we directly recover the GNMDS algorithm. 

Note that directly learning inner products between embedded training data points rather than a 
feature transformation does not allow a meaningful out-of-sample extension, to embed unseen data 
points. On the other hand, by Equation |7J it is clear that the algorithm optimizes over the entire 
cone of PSD matrices. Thus, if C defines a DAG, we could exploit the fact that a partial order 
over distances always allows an embedding which satisfies all constraints in C (see Appendix |A| to 
eliminate the slack variables from the program entirely. 
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4. Multiple kernel embedding 

In the previous section, we derived an algorithm to learn an optimal projection from a kernel space 
% to M rf such that Euclidean distance between embedded points conforms to perceptual similarity. 
If, however, the data is heterogeneous in nature, it may not be realistic to assume that a single feature 
representation can sufficiently capture the inherent structure in the data. For example, if the objects 
in question are images, it may be natural to encode texture information by one set of features, and 
color in another, and it is not immediately clear how to reconcile these two disparate sources of 
information into a single kernel space. 

However, by encoding each source of information independently by separate feature spaces 
T-O-^T-L 2 , ... — equivalently, kernel matrices K 1 , K 2 , ... — we can formulate a multiple kernel 
learning algorithm to optimally combine all feature spaces into a single, unified embedding space. 
In this section, we will derive a novel, projection-based approach to multiple-kernel learning and 
extend Algorithm [3] to support heterogeneous data in a principled way. 



4.1 Unweighted combination 

Let K l , K 2 , . . . , K m be a set of kernel matrices, each with a corresponding feature map 4> p and 
RKHS T~L P , for p G 1, . . . , m. One natural way to combine the kernels is to look at the product 
space, which is formed by concatenating the feature maps: 

fad = (^(x 4 ),0 2 (^),...,r(^)) = m^));u. 

Inner products can be computed in this space by summing across each feature map: 

m 

(<l)(Xi),<l>(xj)) = {4> P {xi), (/f{xj)) HP • 

p=l 

resulting in the sum-kernel — also known as the average kernel or product space kernel. The 
corresponding kernel matrix can be conveniently represented as the unweighted sum of the base 
kernel matrices: 

m 

K = Y,K p . (9) 

P =i 

Since K is a valid kernel matrix itself, we could use K as input for Algorithm |3] As a result, 
the algorithm would learn a kernel from the family 



.p=l / \p=l 



III 



Y K p WK q : W h 

p,q=l 



4.2 Weighted combination 



Note that JC\ treats each kernel equally; it is therefore impossible to distinguish good features (i.e., 
those which can be transformed to best fit C) from bad features, and as a result, the quality of 
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the resulting embedding may be degraded. To combat this phenomenon, it is common to learn a 
scheme for weighting the kernels in a way which is optimal for a particular task. The most common 
approach to combining the base kernels is to take a positive-weighted sum 



P =i 



where the weights /j, p are learned in co njunction with a predictor (ILanckriet et alll2004l : ISonnenburg et al 
20061 : lBachl.l2008l : ICortes et all 120090 . Equivalently, this can be viewed as learning a feature map 



where each base feature map has been scaled by the corresponding weight y^V 

Applying this reasoning to learning an embedding that conforms to perceptual similarity, one 
might consider a two-stage approach to parameterizing the embedding (Figure |3(a) j >: first construc t 
a weighted kernel combination, and then project from the combined kernel space. Lin et al.1 (|2009h 
formulate a dimensionality reduction algorithm in this way. In the present setting, this would be 
achieved by simultaneously optimizing W and /j, p to choose an inner product matrix A from the set 



IC 2 = { 



.p=l 



E VpK p Wfi q K q : W h 0,Vp, (ip > f . 

p,q=l 



(10) 



The corresponding distance constraints, however, contain differences of terms cubic in the opti- 
mization variables W and \i v : 

E (Kf - K?)\ P W N (K? - Kf) + 1 < £ {Kl - iqy^W N (Kl - Kj) , 



p.q 



p.q 



and are therefore non-convex and difficult to optimize. Even simplifying the class by removing 
cross-terms, i.e., restricting A to the form 



/C 3 = 



m 

V 2 p KPWK p 

p=i 



W t 0,Vp, tip > 



(11) 



still leads to a non-convex problem, due to the difference of positive quadratic terms introduced by 
distance calculations: 

E (*f - K i) »l w {^ K i - K f) + 1 ^ E ( K i - kZ) t 4w M - Kl) ■ 
p=i p=i 

However, a more subtle problem with this formulation lies in the assumption that a single weight 
can characterize the contribution of a kernel to the optimal embedding. In general, different kernels 
may be more or less informative on different subsets of X or different regions of the corresponding 
feature space. Constraining the embedding to a single metric W with a single weight fj, p for each 
kernel may be too restrictive to take advantage of this phenomenon. 
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4.3 Concatenated projection 

We now return to the original intuition behind Equation [9] The sum-kernel represents the inner 
product between points in the space formed by concatenating the base feature maps (jp. The sets /C2 
and /C3 characterize projections of the weighted combination space, and turn out to not be amenable 
to efficient optimization (Figure [3(a)] ) . This can be seen as a consequence of prematurely combining 
kernels prior to projection. 

Rather than projecting the (weighted) concatenation of ^> p (-), we could alternatively concatenate 
learned projections M p ((^(-)), as illustrated by Figure [3(b)] Intuitively, by defining the embedding 
as the concatenation of m different projections, we allow the algorithm to learn an ensemble of 
projections, each tailored to its corresponding domain space and jointly optimized to produce an 
optimal space. By contrast, the previously discussed formulations apply essentially the same pro- 
jection to each (weighted) feature space, and are thus much less flexible than our proposed approach. 
Mathematically, an embedding function of this form can be expressed as the concatenation 

g(x) = (M p (J?(x)))™ =1 . 

Now, given this characterization of the embedding function, we can adapt Algorithm [3] to opti- 
mize over multiple kernels. As in the single-kernel case, we introduce regularization terms for each 
projection operator M p 

m 

Eii mp Hhs 

p=l 

to the objective function. Again, by invoking the representer theorem for each M p , it follows that 

M P = N P ( $P )T ^ 

for some matrix N p , which allows to reformulate the embedding problem as a joint optimization 
over N p , p = 1, . . . , m rather than M p , p = 1, . . . , m. Indeed, the regularization terms can be 
expressed as 

m 

E II MP Hhs = tr((N p ) T (N p )K p ). (12) 
P =i 

The embedding function can now be rewritten as 

g(x) = (M p (ctf(x)))™ =1 = (NPKP)™ =1 , 
and the inner products between embedded points take the form: 

m 

Aj = WziUixj)) = E (N p Kf) T (N p Kv) 
P =i 

m 

= Y,(Kf) T (N p ) T (N p )(K^). 
P =i 

Similarly, squared Euclidean distance also decomposes by kernel: 

\\g(xi) - s(^)|| 2 = E(Vf- Kef (N p ) T (N p ) [KV - K*) . (13) 
p=i 
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Kernel class Learned kernel matrix 

£1 = {j2 P ,q KP WKij [k 1 + K 2 + --- + K m ] [W] [K 1 + K 2 + --- + K m ] 



IC2 = 





' K 1 ' 


T 


flfW Hlfl 2 W ■■■ fllUmW 




' K 1 ' 




K 2 




(i\W ■■■ : 




K 2 




K m 











£3 = 





' K 1 ' 


T 


" i*jw 







' K 1 ' 


{E p ^K p WK p } 


K 2 




fj%W ■■ 






K 2 


















K.4 = 





' K 1 ' 


T 


' w 1 







' K 1 ' 




K 2 




w 2 ■■ 






K 2 











w m 




K m 



Table 1: Block-matrix formulations of metric learning for multiple-kernel formulations (/C1-/C4). 

Each W p is taken to be positive semi-definite. Note that all sets are equal when there is 
only one base kernel. 



Finally, since the matrices N p , p = 1, . . . , m only appear in the form of inner products in 
(fL2l ) and (fT3T ). we may instead optimize over PSD matrices W p = (N P ) T (N p ). This renders the 
regularization terms (IT2T) and distances (TT3T ) linear in the optimization variables W p . Extending Al- 
gorithm[3]to this parameterization of g(-) therefore results in an SDP, which is listed as Algorithm|4] 
To solve the SDP, we implemented a gradient descent solver, which is described in Appendix |B] 

The class of kernels over which Algorithm @] optimizes can be expressed algebraically as 



Ki = { Yl KPWPKP : Vp,W p h0\. (14) 

p=i 



Note that IC4 contains /C3 as a special case when all W p are positive scalar multiples of each-other. 
However, JC4 leads to a convex optimization problem, where /C3 does not. 

Table [T]lists the block-matrix formulations of each of the kernel combination rules described in 
this section. It is worth noting that it is certainly valid to first form the unweighted combination ker- 
nel K and then use K,\ (Algorithm [3]) to learn an optimal projection of the product space. However, 
as we will demonstrate in Section [5] our proposed multiple-kernel formulation (/C4) outperforms 
the simple unweighted combination rule in practice. 
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(b) Concatenated projection (X4) 



Figure 3: Two variants of multiple-kernel embedding, [(a)] A data point x G X is mapped into m 
feature spaces via cf) 1 , <p 2 , . . . , <p m , which are then scaled by /i l5 /i 2 , • • • , fJ- m to form a 
weighted feature space %*, which is subsequently projected to the embedding space via 
M. |(b)| x is first mapped into each kernel's feature space and then its image in each space 
is directly projected into a Euclidean space via the corresponding projections M p . The 
projections are jointly optimized to produce the embedding space. 



Algorithm 4 Multiple kernel partial order embedding (MKPOE) 
Input: n objects X, 
partial order C, 

m kernel matrices K 1 , K 2 , . . . , K m , 

P > 

Output: mapping g : X — > M. mn 

min f>(TW) + A£ fe 

' ? p=l 1 1 c 

p=l 

d(xi,Xj) + 1 < ^) + £ijk£ 

&ju>o v(i,j,k,£)ec 

W p hO p = l,2,...,m 
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4.4 Diagonal learning 

The MKPOE optimization is formulated as a semi-definite program over m different nxn matrices 
W p — or, as shown in TableQ] a single mnxmn PSD matrix with a block-diagonal sparsity struc- 
ture. Scaling this approach to large data sets can become problematic, as they require optimizing 
over multiple high-dimensional PSD matrices. 

To cope with larger problems, the optimization problem can be refined to constrain each W p 
to the set of diagonal matrices. If W p are all diagonal, positive semi-definiteness is equivalent to 
non-negativity of the diagonal values (since they are also the eigenvalues of the matrix). This allows 
the constraints W p y to be replaced by linear constraints > 0, and the resulting optimization 
problem is a linear program (LP), rather than an SDP This modification reduces the flexibility of 
the model, but leads to a much more efficient optimization procedure. 

More specifically, our implementation of Algorithm [4] operates by alternating gradient descent 
on W p and projection onto the feasible set W p y (see Appendix |B] for details). For full ma- 
trices, this projection is accomplished by computing the spectral decomposition of each W p , and 
thresholding the eigenvalues at 0. For diagonal matrices, this projection is accomplished simply by 

W*^max{0,W?}, 

which can be computed in 0(mn) time, compared to the 0(mn 3 ) time required to compute m 
spectral decompositions. 

Restricting W p to be diagonal not only simplifies the problem to linear programming, but carries 
the added interpretation of weighting the contribution of each (kernel, training point) pair in the 
construction of the embedding. A large value at corresponds to point i being a landmark for the 
features encoded in K p . Note that each of the formulations listed in Table Q] has a corresponding 
diagonal variant, however, as in the full matrix case, only K\ and /C4 lead to convex optimization 
problems. 



5. Experiments 

To evaluate our framework for learning multi-modal similarity, we first test the multiple kernel 
learning formulation on a simple toy taxonomy data set, and then on a real-world data set of musical 
perceptual similarity measurements. 



5.1 Toy experiment: Taxonomy embedding 

For our first exp eriment, we generated a t oy data set from the Amsterdam Library of Object Images 



(ALOI) data set (|Geusebroek et al.Ll2005Q . ALOI consists of RGB images of 1000 classes of objects 
against a black background. Each class corresponds to a single object, and examples are provided 
of the object under varying degrees of out-of-plane rotation. 

In our experiment, we first selected 10 object classes, and from each class, sampled 20 examples. 
We then constructed an artificial taxonomy over the label set, as depicted in Figure 0] Using the 
taxonomy, we synthesized relative comparisons to span subtrees via their least common ancestor. 
For example, 

(Lemon #1, Lemon#2, Lemon #1, Pear#l), 
(Lemon #1, Pear 1, Lemon #1, Sneakerftl), 
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Orange 




Fruit 



Figure 4: The label taxonomy for the experiment in Section IBTTI 



and so on. These comparisons are consistent and therefore can be represented as a directed acyclic 
graph. They are generated so as to avoid redundant, transitive edges in the graph. 

For features, we generated five kernel matrices. The first is a simple linear kernel over the 
grayscale intensity values of the images, which, roughly speaking, compares objects by shape. The 
other four are Gaussian kernels over histograms in the (background-subtracted) red, green, blue, and 
intensity channels, and these kernels compare objects based on their color or intensity distributions. 

We augment this set of kernels with five "noise" kernels, each of which was generated by sam- 
pling random points from the unit sphere in IR 3 and applying the linear kernel. 

The data was partitioned into five 80/20 training and test set splits. To tune /3, we further 
split the training set for 5-fold cross-validation, and swept over j3 £ {10~ 2 , 10" 1 , . . . , 10 6 }. For 
each fold, we learned a diagonally-constrained embedding with Algorithm 01 using the subset of 
relative comparisons (i,j,k,£) with k and t restricted to the training set. After learning the 
embedding, the held out data (validation or test) was mapped into the space, and the accuracy of the 
embedding was determined by counting the fraction of correctly predicted relative comparisons. In 
the validation and test sets, comparisons were processed to only include comparisons of the form 
i, k) where i belongs to the validation (or test) set, and j and k belong to the training set. 

We repeat this experiment for each base kernel individually (i.e., optimizing over /Ci with a 
single base kernel), as well as the unweighted sum kernel (K\ with all base kernels), and finally 
MKPOE (/C4 with all base kernels). The results are averaged over all training/test splits, and col- 
lected in Table |2] For comparison purposes, we include the prediction accuracy achieved by com- 
puting distances in each kernel's native space before learning. In each case, the optimized space 
indeed achieves higher accuracy than the corresponding native space. (Of course, the random noise 
kernels still predict randomly after optimization.) 

As illustrated in Table I 3{b)| taking the unweighted combination of kernels significantly degrades 
performance (relative to the best kernel) both in the native space (0.718 accuracy versus 0.862 for 
the linear kernel) and the optimized sum-kernel space (0.861 accuracy for K,\ versus 0.951 for the 
linear kernel), i.e., the unweighted sum kernel optimized by Algorithm [3] However, MKPOE (AC4) 
correctly identifies and omits the random noise kernels by assigning them negligible weight, and 
achieves higher accuracy (0.984) than any of the single kernels (0.951 for the linear kernel, after 
learning). 
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(a) 
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Blue 
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(b) 

Accuracy 

Native K,\ K.4 

MKL 0.718 0.861 0.984 



Table 2: Average test set accuracy for the experiment of Section I5TT1 [(a)] Accuracy is computed 
by counting the fraction of correctly predicted relative comparisons in the native space 
of each base kernel, and then in the space produced by KPOE (/Ci with a single base 
kernel). |(b)| The unweighted combination of kernels significantly degrades performance, 
both in the native space, and the learned space (/Ci). MKPOE (/C4) correctly rejects the 
random kernels, and significantly outperforms the unweighted combination and the single 
best kernel. 



5.2 Musical artist similarity 

To test our framework on a real data set, we applied the MKPOE algorithm to the task of learning 
a similarity function between musical artists. The artist similarity problem is motivated by several 
real-world applications, including recommendation and playlist-generation for online radio. Be- 
cause artists may be represented by a wide variety of different features (e.g., tags, acoustic features, 
social data), such applications can benefit greatly from an optimally integrated similarity metric. 



The training data is derived from the aset400 corpus of El lis et al. (2002), which consists of 412 



popular musicians, and 16385 relative comparisons of the form (i,j,i,k). Relative comparisons 
were acquired from human test subjects through a web survey; subjects were presented with a 
query artist (i), and asked to choose what they believe to be the most similar artist (J) from a list of 
10 candidates. From each single response, 9 relative comparisons are synthesized, indicating that j 
is more similar to i than the remaining 9 artists (k) which were not chosen. 



Our experiments here replicate and extend previous work on this data set (IMcFee and Lanckriet 



2009aT) . In the remainder of this section, we will first give an overview of the various types of fea- 
tures used to characterize each artist in Section [5.2. II We will then discuss the experimental proce- 
dure in more detail in Section l5.2.2l The MKL embedding results are presented in Section l5.2.3l and 
are followed by an experiment detailing the efficacy of our constraint graph processing approach in 
Section E231 
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5.2.1 Features 

We construct five base kernels over the data, incorporating acoustic, semantic, and social views of 
the artists. 

• MFCC: for each artist, we collected between 1 and 10 songs (mean 4). For each song, we ex- 
tracted a short clip consisting of 10000 half-overlapping 23ms windo ws. For each window, we 



comp uted the first 13 Mel Frequency Cepstral Coefficients (MFCCs) (|Davis and Mermelsteinl . 



as well as their first and second instantaneous derivatives. This results in a sequence 



of 39-dimensional vectors (delta-MFCCs) for each song. Each artist i was then summarized 
by a Gaussian mixture model (GMM) pi over delta-MFCCs extracted from the corresponding 
songs. Each GMM has 8 components and diagonal co var iance matrices. Fin ally, the kernel 



between artists i and j is the probability product kernel (IJebara et all 120041) between their 
corresponding delta-MFCC distributions Pi,pj: 



mfcc 



\jpi{x)pj{x) dx. 



Auto-tags: Using the MFCC feat ures described above, we applied the automatic tagging 
algorithm of iTurnbull et al. (2008), which for each song yields a multinomial distribution 



over a set T of 149 musically-relevant tag words {auto-tags). Artist-level tag distributions qi 
were formed by averaging model parameters (i.e., tag probabilities) across all of the songs of 
artist i. The kernel between artists i and j for auto-tags is a radial basis function applied to 
the x 2 -distance between the multinomial distributions qi and qf. 



" V KT j 

In these experiments, we fixed a = 256. 

Social tags: For each artist, we collected the top 100 most frequently used tag words from 
Last.fmo a social music website which allows users to label songs or artists with arbitrary 
tag words or social tags. After stemming and stop-word removal, this results in a vocabulary 
of 7737 tag words. Each artist is then represented by a bag-of-words vector in M 7737 , and 
processed by TF-IDF. The kernel between artists for social tags is the cosine similarity (linear 
kernel) between TF-IDF vectors. 

Biography: Last.fm also provides textual descriptions of artists in the form of user-contributed 
biographies. We collected biographies for each artist in the aset400 data set, and after stem- 
ming and stop- word removal, we arrived at a vocabulary of 16753 biography words. As with 
social tags, the kernel between artists is the cosine similarity between TF-IDF bag-of-words 
vectors. 



Collaborative filtering: ICelmal (|2008l ) collected collaborative filtering data from Last.fm in 
the form of a bipartite graph over users and artists, where each user is associated with the 
artists in her listening history. We filtered this data down to include only the aset400 artists, 



2. http://last.fm 
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of which all but 5 were found in the collaborative filtering graph. The resulting graph has 
336527 users and 407 artists, and is equivalently represented by a binary matrix where each 
row i corresponds to an artist, and each column j corresponds to a user. The ij entry of this 
matrix is 1 if we observe a user-artist association, and otherwise. The kernel between artists 
in this view is the cosine of the angle between corresponding rows in the matrix, which can 
be interpreted as counting the amount of overlap between the sets of users listening to each 
artist and normalizing for overall artist popularity. For the 5 artists not found in the graph, we 
fill in the corresponding rows and columns of the kernel matrix with the identity matrix. 

5.2.2 Experimental procedure 

The data set was split into 330 training and 82 test artists. Given the inherent ambiguity in the task 
and the format of the survey, there is a great deal of conflicting information in the survey responses. 
To obtain a more accurate and internally coherent set of training comparisons, directly contradictory 
comparisons (e.g. , (i, j, i, k) and (£, k, are removed from the training set, reducing the set from 
7915 to 6583 relative comparisons. The training set is further cleaned by finding an acyclic subset 
of comparisons and taking its transitive reduction, resulting in a minimal partial order with 4401 
comparisons. 

To evaluate the performance of an embedding learned from the training data, we apply it to the 
test data, and then measure accuracy by counting the fraction of similarity measurements i, k) 
correctly predicted by distance in the embedding space, where i belongs to the test set, and j and k 
belong to the training set. This setup can be viewed as simulating a query (by-example) i and rank- 
ing the responses j, k from the training set. To gain a more accurate view of the quality of the em- 
bedding, the test set was also pruned to remove directly contradictory measurements. This reduces 
the test set from 2095 to 1753 comparisons. No further processing is applied to test measurements, 
and we note that the test set is not internally consistent, so perfect accuracy is not achievable. 

For each experiment, the optimal (3 is chosen from {10 -2 , 10" 1 , . . . , 10 7 } by 10-fold cross- 
validation, i.e., repeating the test procedure above on splits within the training set. Once (3 is chosen, 
an embedding is learned with the entire training set, and then evaluated on the test set. 

5.2.3 Embedding results 

For each base kernel, we evaluate the test-set performance in the native space (i.e., by distances 
calculated directly from the entries of the kernel matrix), and by learned metrics, both diagonal 
and full (optimizing over IC\ with a single base kernel). Table [3] lists the results. In all cases, we 
observe significant improvements in accuracy over the native space. In all but one case, full-matrix 
embeddings significantly outperform diagonally-constrained embeddings. 

We then repeated the experiment by examining different groupings of base kernels: acoustic 
(MFCC and Auto-tags), semantic (Social tags and Bio), social (Collaborative filter), and combina- 
tions of the groups. The different sets of kernels were combined by Algorithm @] (optimizing over 
/C4). The results are listed in Table 0] For comparison purposes, we also include the unweighted 
sum of all base kernels (listed in the Native column). 

In all cases, MKPOE improves over the unweighted combination of base kernels. Moreover, 
many combinations outperform the single best kernel (ST), and the algorithm is generally robust 
in the presence of poorly-performing distractor kernels (MFCC and AT). Note that the poor per- 
formance of MFCC and AT kernels may be expected, as they derive from song-level rather than 
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Kernel 



Accuracy 
Native Ki (diagonal) Ki (full) 



MFCC 0.464 0.593 0.590 

Auto-tags (AT) 0.559 0.568 0.594 

Social tags (ST) 0.752 0.773 0.796 

Biography (Bio) 0.611 0.629 0.760 

Collaborative filter (CF) 0.704 0.655 0.776 

Table 3: aset400 embedding results for each of the base kernels. Accuracy is computed in each 
kernel's native feature space, as well as the space produced by applying Algorithm [3] {i.e., 
optimizing over K,\ with a single kernel) with either the diagonal or full-matrix formula- 
tion. 



Base kernels 



Accuracy 



Native /C 4 (diagonal) £4 (full) 



MFCC + AT 0.521 0.589 0.602 

ST + Bio 0.760 0.786 0.811 

MFCC + AT + CF 0.580 0.671 0.719 

ST + Bio + CF 0.777 0.782 0.806 

MFCC + AT + ST + Bio 0.709 0.788 0.801 

All 0.732 0.779 0.801 



Table 4: aset400 embedding results with multiple kernel learning: the learned metrics are optimized 
over /C4 by Algorithm @] Native corresponds to distances calculated according to the 
unweighted sum of base kernels. 



artist-level features, whereas ST provides high-level semantic descriptions which are generally more 
homogeneous across the songs of an artist, and Bio and CF are directly constructed at the artist 
level. For comparison purposes, we also trained a metric over all kernels with K\ (Algorithm [3]), 
and achieve 0.711 (diagonal) and 0.764 (full): significantly worse than the /C4 results. 

Figure [5] illustrates the weights learned by Algorithm [4] using all five kernels and diagonally- 
constrained W p matrices. Note that the learned metrics are both sparse (many weights) and non- 
uniform across different kernels. In particular, the (lowest-performing) MFCC kernel is eliminated 
by the algorithm, and the majority of the weight is assigned to the (highest-performing) social tag 
(ST) kernel. 



A t-SNE (|van der Maaten and Hintonl. 120081) visualization of the space produced by MKPOE is 
illustrated in Figure |6] The embedding captures a great deal of high-level genre structure in low 
dimensions: for example, the classic rock and metal genres lie at the opposite end of the space from 
pop and hip-hop. 
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Figure 5: The weighting learned by Algorithm |4] using all five kernels and diagonal W p . Each bar 
plot contains the diagonal of the corresponding kernel's learned metric. The horizontal 
axis corresponds the index of the training set, and the vertical axis corresponds to the 
learned weight in each kernel space. 



5.2.4 Graph processing results 

To evaluate the effects of processing the constraint set for consistency and redundancy, we repeat 
the experiment of the previous section with different levels of processing applied to C. Here, we 
focus on the Biography kernel, since it exhibits the largest gap in performance between the native 
and learned spaces. 

As a baseline, we first consider the full set of similarity measurements as provided by human 
judgements, including all inconsistencies. In the 80-20 split, there are 7915 total training mea- 
surements. To first deal with what appeal - to be the most eggregious inconsistencies, we prune all 
directly inconsistent training measurements; i.e., whenever (i,j,i,k) and (i,k,i,j) both appear, 
both are removedH This variation results in 6583 training measurements, and while they are not 
wholly consistent, the worst violators have been pruned. Finally, we consider the fully processed 
case by finding a maximal consistent subset (partial order) of C and removing all redundancies, 
resulting in a partial order with 4401 measurements. 

Using each of these variants of the training set, we test the embedding algorithm with both diag- 
onal and full-matrix formulations. The results are presented in Table[5] Each level of graph process- 
ing results in a small improvement in the accuracy of the learned space, and provides substantial 
reductions in computational overhead at each step of the optimization procedure for Algorithm [3] 



3. A more sophisticated approach could be used here, e.g., majority voting, provided there is sufficient over-sampling 
of comparisons in the data. 
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Figure 6: t-SNE visualizations of an embedding of aset400 produced by MKPOE. The embedding 
is constructed by optimizing over AC4 with all five base kernels. The two clusters shown 
roughly corrsepond to |(a)| pop/hip-hop, and [(b)] classic rock/metal genres. Out-of-sample 
points are indicated by a red +. 
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Accuracy 
C Diagonal Full 

Full 0.604 0.754 

Length-2 0.621 0.756 
Processed 0.629 0.760 



Table 5: aset400 embedding results (Biography kernel) for three possible refinements of the con- 
straint set. Full includes all similarity measurements, with no pruning for consistency 
or redundancy. Length-2 removes all length-2 cycles (i.e., (i,j,k,£) and (k,£,i,j)). 
Processed finds an approximate maximal consistent subset, and removes redundant con- 
straints. 



6. Hardness of dimensionality reduction 

The algorithms given in Sections |3] and @] attempt to produce low-dimensional solutions by regular- 
izing W, which can be seen as a convex approximation to the rank of the embedding. In general, 
because rank constraints are not convex, convex optimization techniques cannot efficiently mini- 
mize dimensionality. This does not necessarily imply other techniques could not work. So, it is 
natural to ask if exact solutions of minimal dimensionality can be found efficiently, particularly in 
the multidimensional scaling scenario, i.e., when K = I (Section [33T >. 

As a special case, one may wonder if any instance (X,C) can be satisfied in R 1 . As Figure [7] 
demonstrates, not all instances can be realized in one dimension. Even more, we show that it is 
NP-Complete to decide if a given C can be satisfied in R 1 . Given an embedding, it can be verified in 
polynomial time whether C is satisfied or not by simply computing the distances between all pairs 
and checking each comparison in C, so the decision problem is in NP. It remains to show that the 
R 1 partial order embeddin g problem (here after referred to as 1-POE) is NP-Hard. We reduce from 
the Betweenness problem ( Opatrny . 1979b . which is known to be NP-complete. 



Definition 6.1 (Betweenness) Given a finite set Z and a collection T of ordered triples (a, b, c) of 
distinct elements from Z, is there a one-to-one function f : Z — > R such that for each (a, b, c) G T, 
either /(a) < f(b) < /(c) or /(c) < f(b) < /(a)? 



Theorem 1 1-POE is NP-Hard. 



Proof Let (Z, T) be an instance of Betweenness. Let X = Z, and for each (a, b, c) G T, in- 
troduce constraints (a, b, a, c) and (b, c, a, c) to C. Since Euclidean distance in R 1 is simply line 
distance, these constraints force g(b) to lie between g(a) and g(c). Therefore, the original instance 
(Z,T) G Betweenness if and only if the new instance (X,C) G 1-POE. Since Betweenness is 
NP-Hard, 1-POE is NP-Hard as well. ■ 



Since 1-POE can be reduced to the general optimization problem of finding an embedding of 
minimal dimensionality, we can conclude that dimensionality reduction subject to partial order con- 
straints is also NP-Hard. 
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(a) 



(b) 



Figure 7: (a) The vertices of a square in 



(5)1 The partial order over distances induced by the 
square: each side is less than each diagonal. This constraint set cannot be satisfied in M 1 . 



7. Conclusion 

We have demonstrated a novel method for optimally integrating heterogeneous data to conform to 
measurements of perceptual similarity. By interpreting a collection of relative similarity compar- 
isons as a directed graph over pairs, we are able to apply graph-theoretic techniques to isolate and 
prune inconsistencies in the training set and reduce computational overhead by eliminating redun- 
dant constraints in the optimization procedure. 

Our multiple-kernel formulation offers a principled way to integrate multiple feature modalities 
into a unified similarity space. Our formulation carries the intuitive geometric interpretation of con- 
catenated projections, and results in a semidefinite program. By incorporating diagonal constraints 
as well, we are able to reduce the computational complexity of the algorithm, and learn a model 
which is both flexible — only using kernels in the portions of the space where they are informative 
— and interpretable — each diagonal weight corresponds to the contribution to the optimized space 
due to a single point within a single feature space. Tabled] provides a unified perspective of multi- 
ple kernel learning formulations for embedding problems, but it is clearly not complete. It will be 
the subject of future work to explore and compare alternative generalizations and restrictions of the 
formulations presented here. 



Appendix A. Embeddability of partial orders 

In this appendix, we prove that any set X with a partial order over distances C can be embedded 
into BL n while satisfying all distance comparisons. 

In the special case where C is a total orderin g over all pairs (i.e., a chain graph), the problem 
reduces to non-metric multidimensional scaling (IKruskall. 1 19641) . and a co nstraint-sati s fying em- 
bedding can always be found by the constant-shift embedding algorithm of iRoth et all (|2003h . In 
general, C is not a total order, but a C-respecting embedding can always be produced by reducing 
the partial order to a (weak) total order by topologically sorting the graph (see Algorithm |5]). 

Let A be the dissimilarity matrix produced by Algorithm [5] on an instanc e (X,C). An embed- 
ding can be found by first applying classical multidimensional scaling (MDS) (iCox and Coxll 19941) 
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Algorithm 5 Naive total order construction 
Input: objects X, partial order C 
Output: symmetric dissimilarity matrix Aeff 

for each i in 1 ... n do 

An <- 
end for 

for each (k,£) in topological order do 
if in-degree(fc, I) = then 

Afc£, Aik «— 1 
else 

Ah, A ik <- max Ay + 1 

(i,j,k,£)eC 

end if 
end for 



to A: 



-HAH, 
2 



(15) 



where H = I — ^11 T is the nxn centering matrix, and 1 is a vector of Is. Shifting the spectrum 
of A yields 

A - X n (A)I = A £ 0, (16) 
where X n (A) is the minimum eigenvalue of A. The embedding g can be found by decomposing 

^ ^1/2 

A = VAV T , so that g(x{) is the % column of A V T ; this is the sol ution constructed by the 
constant-shift embedding non-metric MDS algorithm of iRoth et all (|2003l) . 
Applying this transformation to A affects distances by 



\g(xi) -g(xj) 



An + Ajj 2Aij — (An A n ) + \A 



33 



An + A 



j j 



2Ai 



-A n ) 
2A n . 



2Ai 



Since adding a constant (— 2A n ) preserves the ordering of distances, the total order (and hence C) 
is preserved by this transformation. Thus, for any instance (X,C), an embedding can be found in 



pn— 1 



Appendix B. Solver 

Our implementation of Algorithm @] is based on a simple projected (sub)gradient descent. To sim- 
plify exposition, we show the derivation of the single-kernel SDP version of the algorithm (Algo- 
rithm [3]) with unit margins. (It is straightforward to extend the derivation to the multiple-kernel and 
LP settings.) 

We first observe that a kernel matrix column Ki can be expressed as K T ei where &i is the i 
standard basis vector. We can then denote the distance calculations in terms of Frobenius inner 
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products: 

d(xi, Xj ) = (Ki - K 3 ) T W(Ki - Kj) 
= (e, - e J ) J KWK{e l - ej) 
= tr(KWK( ei - e^fa - ej) J ) = tr(WKEijK) 
= (W,KE ij K) F , 

where E^ = (e* — e 3 -)(ej — ej) T . 

A margin constraint (i, j, A;, ^) can now be expressed as: 

d(xi,Xj) + 1 < d(x k ,x £ ) + iijki 

=> (W, KEijK) F + 1<(W, KE M K) F + £ ijki 

=> fcjM > i + (w, - • 

The slack variables can be eliminated from the program by rewriting the objective in terms 
of the hinge loss h(-) over the constraints: 

min f(W) where f(W) = tr(WK) + A i 1 + ( W > K ( E H ~ E kt)K) F ) ■ 

I I e 

The gradient V/ has two components: one due to regularization, and one due to the hinge loss. 
The gradient due to regularization is simply K. The loss term decomposes linearly, and for each 
(i,j,k,£) G C, a subgradient direction can be defined: 

<9 , Jo d{xi, Xj ) + 1 <d{x k ,xe) 

-—h(l + d{xi,x j )-d{x k ,x e )) = < . (17) 

dW \K(Eij-E kt )K otherwise. 

Rather than computing each gradient direction independently, we observe that each violated con- 
straint contributes a matrix of the form K(Eij — Eki)K. By linearity, we can collect all (Eij — E k g) 
terms and then pre- and post-multiply by K to obtain a more efficient calculation of V/: 

- E ke I K, 




where C is the set of all currently violated constraints. 

After each gradient step W i-> W — aVf, the updated W is projected back onto the set of posi- 
tive semidefinite matrices by computing its spectral decomposition and thresholding the eigenvalues 
by Aj i — y max(0, Aj). 

To extend this derivation to the multiple-kernel case (Algorithm 01), we can define 

in 

d(xi, Xj) — ^ dpjxi, Xj^), 
P =i 

and exploit linearity to compute each partial derivative d/dW p independently. 
For the diagonally-constrained case, it suffices to substitute 

K(Eij - E U )K ' V din-; A' (/•.•;, - E M )K) 

in Equation [T7J After each gradient step in the diagonal case, the PSD constraint on W can be 
enforced by the projection Wu h-» max(0, Wu). 
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Appendix C. Relationship to AUC 

In this appendix, we formalize the connection between partial orders over distances and query-by- 
example ranking. Recall that Algorithm |2] minimizes the loss 1 /\c\ J2c £ijk£> where each £ijkt > 
is a slack variable associated with a margin constraint 

d(i,j) + 1 < d(k,£)+Zi jk i. 



As noted by Schultz and Joachims! (2004), the fraction of relative comparisons satisfied by an 



embedding g is closely related to the area under the receiver operating characteristic curve (AUC). 
To make this connection precise, consider the following information retrieval problem. For each 
point Xi G X , we are given a partition of X \ {i}: 

Xf = {xj : Xj G X relevant for Xj}, and 
XT' = {xk ■ Xk G X irrelevant for Xi}. 

If we embed each xi G X into a Euclidean space, we can then rank the rest of the data X \ {xi} by 
increasing distance from x». Truncating this ranked list at the top r elements (i.e., closest r points 
to Xi) will return a certain fraction of relevant points (true positives), and irrelevant points (false 
positives). Averaging over all values of r defines the familiar AUC score, which can be compactly 
expressed as: 

AUC(^|g)= 1 £ iMx^-gix^W < \\g(xi)-g(x k )\\]. 

Intuitively, AUC can be interpreted as an average over all pairs (xj,x k ) G Xf x XT' of the 
number of times a x^ was mapped closer to a relevant point xj than an irrelevant point x^. This in 
turn can be conveniently expressed by a set of relative comparisons for each x; L G X: 

V(xj,x k ) G X t + x X~ : (i,j,i,k). 

An embedding which satisfies a complete set of constraints of this form will receive an AUC score 
of 1, since every relevant point must be closer to xi than every irrelevant point. 

Now, returning to the more general setting, we do not assume binary relevance scores or com- 
plete observations of relevance for all pairs of points. However, we can define the generalized 
AUC score (GAUC) as simply the average number of correctly ordered pairs (equivalently, satisfied 
constraints) given a set of relative comparisons: 

GAUC( 9 ) = -^ ]T 1 [\\g{ Xi )- g(xj)\\ < \\g(x k )-g(xe)\\}. (18) 

Like AUC, GAUC is bounded between and 1, and the two scores coincide exactly in the previously 
described ranking problem. A corresponding loss function can be defined by reversing the order of 
the inequality, i.e., 

LcAvc(g) = t4- ^2 1 i\\g(xi) - g(xj)\\ > \\g(x k ) - g(xi)\\] ■ 
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Note that Lgauc takes the form of a sum over indicators, and can be interpreted as the average 
0/1-loss over C. This function is clearly not convex in g, and is therefore difficult to optimize. 
Algorithms |2j [3] and 0] instead optimize a convex upper bound on Lgauc by replacing indicators by 
the hinge loss: 



As in SVM, this is accomplished by introducing a unit margin and slack variable ^ijki for each 
k, i) G C, and minimizing i/|C| J2c Zijki- 
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